- 
                Notifications
    You must be signed in to change notification settings 
- Fork 22
Granular async instrumentation #687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| PR Code Suggestions ✨Latest suggestions up to b221be4 Explore these optional code suggestions: 
 Previous suggestionsSuggestions up to commit 91b8902
 Suggestions up to commit c9aaaad
 Suggestions up to commit 14af1a8
 | 
52dbe88    to
    b153989      
    Compare
  
    ef07e94    to
    0a57afa      
    Compare
  
    | add an e2e test for this | 
| Add tests in the style of codeflash/tests/test_instrument_tests.py Line 289 in 674e69e 
 | 
| import_transformer = AsyncDecoratorImportAdder(mode) | ||
| module = module.visit(import_transformer) | ||
|  | ||
| return isort.code(module.code, float_to_top=True), decorator_transformer.added_decorator | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are isorting user's code?
apply testgen-async fix bug when iterating over star imports fix cst * import errors
…687 (`granular-async-instrumentation`) The optimization replaces expensive `Path` object creation and method calls with direct string manipulation operations, delivering a **491% speedup**. **Key optimizations:** 1. **Eliminated Path object overhead**: Replaced `Path(filename).stem.startswith("test_")` with `filename.rpartition('/')[-1].rpartition('\\')[-1].rpartition('.')[0].startswith("test_")` - avoiding Path instantiation entirely. 2. **Optimized path parts extraction**: Replaced `Path(filename).parts` with `filename.replace('\\', '/').split('/')` - using simple string operations instead of Path parsing. **Performance impact analysis:** - Original profiler shows lines 25-26 (Path operations) consumed **86.3%** of total runtime (44.7% + 41.6%) - Optimized version reduces these same operations to just **25.4%** of runtime (15% + 10.4%) - The string manipulation operations are ~6x faster per call than Path object creation **Test case benefits:** - **Large-scale tests** see the biggest gains (516% faster for 900-frame stack, 505% faster for 950-frame chain) because the Path overhead multiplies with stack depth - **Edge cases** with complex paths benefit significantly (182-206% faster for subdirectory and pytest frame tests) - **Basic tests** show minimal overhead since Path operations weren't the bottleneck in shallow stacks The optimization maintains identical behavior while eliminating the most expensive operations identified in the profiling data - Path object instantiation and method calls that occurred once per stack frame.
add End to end test for async optimization
Get throughput from output for async functions
| matches_re_end = re.compile(r"!######(.*?):(.*?)([^\.:]*?):(.*?):(.*?):(.*?)######!") | ||
|  | ||
|  | ||
| start_pattern = re.compile(r"!\$######([^:]*):([^:]*):([^:]*):([^:]*):([^:]+)######\$!") | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this pattern different from the one above? i expect the regexes to be the same
|  | ||
| results_list = test_results.test_results | ||
| async_calls = [r for r in results_list if r.id.function_getting_tested == "async_merge_sort"] | ||
| assert len(async_calls) >= 1 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test needs to be improved. See all the properties we are testing in the sync version of this, and test those as well.
Things like - the return values, the test id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is important to give us confidence that these critical parameters are correct and never broken in the future.
| assert test_results.test_results is not None | ||
| assert len(test_results.test_results) >= 2 | ||
|  | ||
| results_list = test_results.test_results | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test for test_ids and return values exactly
|  | ||
| assert test_results is not None | ||
| assert test_results.test_results is not None | ||
| assert len(test_results.test_results) >= 2 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use equal operator in the test. make it precise
| # Check that comments were added | ||
| modified_source = result.generated_tests[0].generated_original_test_source | ||
| assert modified_source == expected | ||
| assert modified_source == expected | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eventually we should also add the throughput annotations when available
| # Runtime performance evaluation | ||
| noise_floor = 3 * MIN_IMPROVEMENT_THRESHOLD if original_code_runtime < 10000 else MIN_IMPROVEMENT_THRESHOLD | ||
| if not disable_gh_action_noise and env_utils.is_ci(): | ||
| noise_floor = noise_floor * 2 # Increase the noise floor in GitHub Actions mode | ||
|  | ||
| perf_gain = performance_gain( | ||
| original_runtime_ns=original_code_runtime, optimized_runtime_ns=candidate_result.best_test_runtime | ||
| ) | ||
| if best_runtime_until_now is None: | ||
| # collect all optimizations with this | ||
| return bool(perf_gain > noise_floor) | ||
| return bool(perf_gain > noise_floor and candidate_result.best_test_runtime < best_runtime_until_now) | ||
| runtime_improved = perf_gain > noise_floor | ||
|  | ||
| # Check runtime comparison with best so far | ||
| runtime_is_best = best_runtime_until_now is None or candidate_result.best_test_runtime < best_runtime_until_now | ||
|  | ||
| throughput_improved = True # Default to True if no throughput data | ||
| throughput_is_best = True # Default to True if no throughput data | ||
|  | ||
| if original_async_throughput is not None and candidate_result.async_throughput is not None: | ||
| if original_async_throughput > 0: | ||
| throughput_gain_value = throughput_gain( | ||
| original_throughput=original_async_throughput, optimized_throughput=candidate_result.async_throughput | ||
| ) | ||
| throughput_improved = throughput_gain_value > MIN_THROUGHPUT_IMPROVEMENT_THRESHOLD | ||
|  | ||
| throughput_is_best = ( | ||
| best_throughput_until_now is None or candidate_result.async_throughput > best_throughput_until_now | ||
| ) | ||
|  | ||
| if original_async_throughput is not None and candidate_result.async_throughput is not None: | ||
| # When throughput data is available, accept if EITHER throughput OR runtime improves significantly | ||
| throughput_acceptance = throughput_improved and throughput_is_best | ||
| runtime_acceptance = runtime_improved and runtime_is_best | ||
| return throughput_acceptance or runtime_acceptance | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚡️Codeflash found 16% (0.16x) speedup for speedup_critic in codeflash/result/critic.py
⏱️ Runtime : 3.15 milliseconds → 2.71 milliseconds (best of 98 runs)
📝 Explanation and details
The optimized code achieves a 16% speedup by eliminating function call overhead and streamlining conditional logic in the performance-critical speedup_critic function.
Key optimizations:
- 
Inlined performance calculation: Instead of calling performance_gain(), the performance gain is calculated directly inline as(original_code_runtime - candidate_result.best_test_runtime) / candidate_result.best_test_runtime. This eliminates function call overhead which was consuming 35.2% of the original execution time according to the line profiler.
- 
Inlined throughput calculation: Similarly, the throughput gain calculation is moved inline as (candidate_result.async_throughput - original_async_throughput) / original_async_throughput, removing another function call.
- 
Streamlined conditional structure: The throughput evaluation logic is reorganized to eliminate redundant variable assignments and combine the final decision logic more efficiently. The original code had separate variables for throughput_acceptanceandruntime_acceptance, while the optimized version directly returns the combined condition.
- 
Reduced variable assignments: Eliminated unnecessary intermediate variables like throughput_improved = Truedefaults, handling the logic more directly within the conditional branches.
The line profiler shows the original performance_gain and throughput_gain function calls took significant time (25.8ms and 9.9ms respectively out of 73.3ms total). By inlining these simple calculations, the optimized version reduces total execution time to 31.0ms.
These optimizations are particularly effective for high-volume scenarios where speedup_critic is called frequently, as evidenced by the large-scale test cases showing consistent 13-20% improvements when processing hundreds or thousands of candidates.
✅ Correctness verification report:
| Test | Status | 
|---|---|
| ⚙️ Existing Unit Tests | ✅ 14 Passed | 
| 🌀 Generated Regression Tests | ✅ 5050 Passed | 
| ⏪ Replay Tests | 🔘 None Found | 
| 🔎 Concolic Coverage Tests | 🔘 None Found | 
| 📊 Tests Coverage | 100.0% | 
⚙️ Existing Unit Tests and Runtime
| Test File::Test Function | Original ⏱️ | Optimized ⏱️ | Speedup | 
|---|---|---|---|
| test_critic.py::test_speedup_critic | 4.67μs | 4.26μs | 9.65%✅ | 
| test_critic.py::test_speedup_critic_with_async_throughput | 9.48μs | 8.40μs | 12.7%✅ | 
🌀 Generated Regression Tests and Runtime
from __future__ import annotations
import os
from dataclasses import dataclass
from functools import lru_cache
# imports
import pytest  # used for our unit tests
from codeflash.result.critic import speedup_critic
# Simulate codeflash.code_utils.config_consts
MIN_IMPROVEMENT_THRESHOLD = 0.01  # 1%
MIN_THROUGHPUT_IMPROVEMENT_THRESHOLD = 0.01  # 1%
# Simulate codeflash.models.models.OptimizedCandidateResult
@dataclass
class OptimizedCandidateResult:
    best_test_runtime: int
    async_throughput: int | None = None
from codeflash.result.critic import speedup_critic
# ---- BASIC TEST CASES ----
def test_basic_runtime_improvement_above_threshold():
    # Test: optimized code is 10% faster, above 1% threshold
    orig = 100_000  # ns
    opt = 90_000    # ns (10% faster)
    cand = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(cand, orig, None) # 2.13μs -> 2.06μs (3.39% faster)
def test_basic_runtime_improvement_below_threshold():
    # Test: optimized code is 0.5% faster, below 1% threshold
    orig = 100_000
    opt = 99_500  # 0.5% faster
    cand = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(cand, orig, None) # 2.09μs -> 1.88μs (11.2% faster)
def test_basic_runtime_no_improvement():
    # Test: optimized code is slower
    orig = 100_000
    opt = 110_000
    cand = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(cand, orig, None) # 1.98μs -> 1.85μs (7.07% faster)
def test_basic_runtime_improvement_and_best_so_far():
    # Test: improvement and better than previous best
    orig = 100_000
    opt = 90_000
    prev_best = 91_000
    cand = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(cand, orig, prev_best) # 2.11μs -> 1.98μs (6.50% faster)
def test_basic_runtime_improvement_not_best_so_far():
    # Test: improvement but not better than previous best
    orig = 100_000
    opt = 92_000
    prev_best = 90_000
    cand = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(cand, orig, prev_best) # 2.03μs -> 1.91μs (6.33% faster)
def test_basic_throughput_improvement_above_threshold():
    # Test: throughput improvement above threshold, no runtime improvement
    orig = 100_000
    opt = 100_000
    orig_through = 1000
    opt_through = 1100  # 10% improvement
    cand = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_through)
    codeflash_output = speedup_critic(cand, orig, None, original_async_throughput=orig_through) # 3.06μs -> 2.65μs (15.1% faster)
def test_basic_throughput_improvement_below_threshold():
    # Test: throughput improvement below threshold, no runtime improvement
    orig = 100_000
    opt = 100_000
    orig_through = 1000
    opt_through = 1005  # 0.5% improvement
    cand = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_through)
    codeflash_output = speedup_critic(cand, orig, None, original_async_throughput=orig_through) # 2.71μs -> 2.37μs (14.4% faster)
def test_basic_throughput_and_runtime_improvement_either_suffices():
    # Test: throughput improved, runtime not; should accept
    orig = 100_000
    opt = 100_000
    orig_through = 1000
    opt_through = 1100
    cand = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_through)
    codeflash_output = speedup_critic(cand, orig, None, original_async_throughput=orig_through) # 2.65μs -> 2.35μs (12.3% faster)
    # Test: runtime improved, throughput not; should accept
    orig = 100_000
    opt = 90_000
    orig_through = 1000
    opt_through = 1000
    cand = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_through)
    codeflash_output = speedup_critic(cand, orig, None, original_async_throughput=orig_through) # 1.63μs -> 1.38μs (18.2% faster)
def test_basic_throughput_best_so_far():
    # Test: throughput improved but not best so far
    orig = 100_000
    opt = 100_000
    orig_through = 1000
    opt_through = 1100
    prev_best_through = 1200
    cand = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_through)
    codeflash_output = speedup_critic(
        cand, orig, None, original_async_throughput=orig_through, best_throughput_until_now=prev_best_through
    ) # 2.90μs -> 2.50μs (16.1% faster)
def test_basic_runtime_and_throughput_best_so_far():
    # Test: runtime and throughput both improved and both best so far
    orig = 100_000
    opt = 90_000
    orig_through = 1000
    opt_through = 1100
    prev_best = 91_000
    prev_best_through = 1050
    cand = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_through)
    codeflash_output = speedup_critic(
        cand, orig, prev_best, original_async_throughput=orig_through, best_throughput_until_now=prev_best_through
    ) # 3.01μs -> 2.62μs (15.0% faster)
# ---- EDGE TEST CASES ----
def test_edge_runtime_exactly_at_threshold():
    # Test: improvement exactly at threshold (should be False, must be > threshold)
    orig = 100_000
    opt = int(orig / (1 + MIN_IMPROVEMENT_THRESHOLD))  # exactly 1% faster
    cand = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(cand, orig, None) # 1.89μs -> 1.73μs (9.17% faster)
def test_edge_throughput_exactly_at_threshold():
    # Test: throughput improvement exactly at threshold (should be False)
    orig = 100_000
    opt = 100_000
    orig_through = 1000
    opt_through = int(orig_through * (1 + MIN_THROUGHPUT_IMPROVEMENT_THRESHOLD))
    cand = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_through)
    codeflash_output = speedup_critic(cand, orig, None, original_async_throughput=orig_through) # 2.77μs -> 2.30μs (20.0% faster)
def test_edge_runtime_below_10us_noise_floor():
    # Test: original runtime below 10us, noise floor is 3x
    orig = 9000  # 9us
    opt = int(orig / (1 + 3 * MIN_IMPROVEMENT_THRESHOLD)) - 1  # Just above noise floor
    cand = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(cand, orig, None) # 1.91μs -> 1.81μs (5.57% faster)
    # Now just below noise floor
    opt = int(orig / (1 + 3 * MIN_IMPROVEMENT_THRESHOLD)) + 1
    cand = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(cand, orig, None) # 872ns -> 761ns (14.6% faster)
def test_edge_runtime_zero_optimized_runtime():
    # Test: optimized_runtime_ns == 0 (should return False, as gain is 0)
    orig = 100_000
    opt = 0
    cand = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(cand, orig, None) # 2.13μs -> 1.73μs (23.1% faster)
def test_edge_throughput_zero_original_throughput():
    # Test: original throughput is zero (should not crash, gain is 0)
    orig = 100_000
    opt = 100_000
    orig_through = 0
    opt_through = 1000
    cand = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_through)
    codeflash_output = speedup_critic(cand, orig, None, original_async_throughput=orig_through) # 2.77μs -> 2.50μs (10.4% faster)
def test_edge_throughput_none_values():
    # Test: async_throughput is None in candidate (should default to runtime logic)
    orig = 100_000
    opt = 90_000
    cand = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=None)
    codeflash_output = speedup_critic(cand, orig, None, original_async_throughput=1000) # 2.54μs -> 2.20μs (15.0% faster)
    # Test: original_async_throughput is None (should default to runtime logic)
    cand = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=1100)
    codeflash_output = speedup_critic(cand, orig, None, original_async_throughput=None) # 1.10μs -> 972ns (13.4% faster)
def test_large_scale_many_candidates_runtime(monkeypatch):
    # Test: 1000 candidates, only one is best and above threshold
    orig = 100_000
    prev_best = 90_000
    # All candidates are worse than prev_best except one
    results = [OptimizedCandidateResult(best_test_runtime=orig - i) for i in range(1000)]
    # Only candidate at index 950 is below prev_best and above threshold
    results[950] = OptimizedCandidateResult(best_test_runtime=85_000)
    count = 0
    for cand in results:
        if speedup_critic(cand, orig, prev_best):
            count += 1
def test_large_scale_throughput(monkeypatch):
    # Test: 500 candidates, only a few have throughput above threshold and best so far
    orig = 100_000
    orig_through = 1000
    prev_best_through = 1100
    results = []
    for i in range(500):
        # Most are below threshold
        results.append(OptimizedCandidateResult(best_test_runtime=orig, async_throughput=orig_through + i))
    # Only candidate at index 499 is above prev_best_through and above threshold
    results[499] = OptimizedCandidateResult(best_test_runtime=orig, async_throughput=1200)
    count = 0
    for cand in results:
        if speedup_critic(
            cand, orig, None, original_async_throughput=orig_through, best_throughput_until_now=prev_best_through
        ):
            count += 1
def test_large_scale_runtime_and_throughput_combined():
    # Test: 100 candidates, some with runtime improvement, some with throughput, some both
    orig = 100_000
    orig_through = 1000
    prev_best = 95_000
    prev_best_through = 1050
    results = []
    # 10 with runtime improvement and best so far
    for i in range(10):
        results.append(OptimizedCandidateResult(best_test_runtime=90_000 - i, async_throughput=1000))
    # 10 with throughput improvement and best so far
    for i in range(10):
        results.append(OptimizedCandidateResult(best_test_runtime=100_000, async_throughput=1100 + i))
    # 10 with both
    for i in range(10):
        results.append(OptimizedCandidateResult(best_test_runtime=90_000 - i, async_throughput=1100 + i))
    # The rest with no improvement
    for i in range(70):
        results.append(OptimizedCandidateResult(best_test_runtime=100_000, async_throughput=1000))
    count = 0
    for cand in results:
        if speedup_critic(
            cand, orig, prev_best, original_async_throughput=orig_through, best_throughput_until_now=prev_best_through
        ):
            count += 1
def test_large_scale_edge_case_all_equal():
    # Test: all candidates have identical performance, none should pass
    orig = 100_000
    prev_best = 90_000
    orig_through = 1000
    prev_best_through = 1100
    results = [OptimizedCandidateResult(best_test_runtime=100_000, async_throughput=1000) for _ in range(500)]
    for cand in results:
        codeflash_output = speedup_critic(
            cand, orig, prev_best, original_async_throughput=orig_through, best_throughput_until_now=prev_best_through
        ) # 385μs -> 321μs (19.9% faster)
def test_large_scale_edge_case_all_best():
    # Test: all candidates are best and above threshold, all should pass
    orig = 100_000
    prev_best = 110_000
    orig_through = 1000
    prev_best_through = 900
    results = [OptimizedCandidateResult(best_test_runtime=90_000, async_throughput=1200) for _ in range(500)]
    for cand in results:
        codeflash_output = speedup_critic(
            cand, orig, prev_best, original_async_throughput=orig_through, best_throughput_until_now=prev_best_through
        ) # 384μs -> 321μs (19.5% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
#------------------------------------------------
from __future__ import annotations
import os
from dataclasses import dataclass
from functools import lru_cache
# imports
import pytest  # used for our unit tests
from codeflash.result.critic import speedup_critic
# --- Minimal stubs for external dependencies/constants ---
# These are required for the function to run in this test file.
MIN_IMPROVEMENT_THRESHOLD = 0.01  # 1%
MIN_THROUGHPUT_IMPROVEMENT_THRESHOLD = 0.01  # 1%
@dataclass
class OptimizedCandidateResult:
    best_test_runtime: int  # in nanoseconds
    async_throughput: int | None = None
from codeflash.result.critic import speedup_critic
# ----------- 1. BASIC TEST CASES -----------
def test_basic_significant_runtime_improvement():
    # Optimized code is 20% faster than original, above threshold
    orig = 100_000  # ns
    opt = 80_000    # ns (20% faster)
    candidate = OptimizedCandidateResult(best_test_runtime=opt)
    # No previous best
    codeflash_output = speedup_critic(candidate, orig, None) # 2.42μs -> 2.11μs (14.7% faster)
def test_basic_insufficient_runtime_improvement():
    # Only 0.5% improvement, below threshold
    orig = 100_000
    opt = 99_500  # 0.5% improvement
    candidate = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(candidate, orig, None) # 2.08μs -> 1.94μs (7.15% faster)
def test_basic_no_improvement():
    # Optimized code is slower
    orig = 100_000
    opt = 120_000
    candidate = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(candidate, orig, None) # 2.00μs -> 1.89μs (5.81% faster)
def test_basic_best_runtime_until_now():
    # There is a previous best, and this candidate is not better
    orig = 100_000
    opt = 80_000
    candidate = OptimizedCandidateResult(best_test_runtime=opt)
    best_so_far = 75_000  # Already have a better one
    codeflash_output = speedup_critic(candidate, orig, best_so_far) # 2.13μs -> 1.92μs (11.0% faster)
def test_basic_new_best_runtime():
    # There is a previous best, and this candidate is better
    orig = 100_000
    opt = 70_000
    candidate = OptimizedCandidateResult(best_test_runtime=opt)
    best_so_far = 75_000
    codeflash_output = speedup_critic(candidate, orig, best_so_far) # 1.94μs -> 1.93μs (0.517% faster)
def test_basic_throughput_improvement_accepts():
    # Throughput improvement is significant, runtime is not
    orig = 100_000
    opt = 99_500
    orig_thr = 1000
    opt_thr = 1100  # 10% improvement
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_thr)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr) # 3.14μs -> 2.73μs (14.7% faster)
def test_basic_throughput_no_improvement_rejects():
    # Throughput improvement is below threshold, runtime is not improved
    orig = 100_000
    opt = 99_500
    orig_thr = 1000
    opt_thr = 1005  # 0.5% improvement
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_thr)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr) # 2.71μs -> 2.42μs (12.0% faster)
def test_basic_throughput_and_runtime_both_improve():
    # Both throughput and runtime improve significantly
    orig = 100_000
    opt = 80_000
    orig_thr = 1000
    opt_thr = 1200
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_thr)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr) # 2.74μs -> 2.32μs (17.7% faster)
def test_basic_throughput_is_best_check():
    # Throughput improvement is significant but not best so far, should reject
    orig = 100_000
    opt = 99_000
    orig_thr = 1000
    opt_thr = 1100
    best_thr = 1200  # Already have a better throughput
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_thr)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr, best_throughput_until_now=best_thr) # 3.00μs -> 2.62μs (14.1% faster)
def test_basic_throughput_best_but_runtime_not_best():
    # Throughput is new best, runtime is not best but throughput is enough
    orig = 100_000
    opt = 99_000
    orig_thr = 1000
    opt_thr = 1300
    best_thr = 1200
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_thr)
    codeflash_output = speedup_critic(candidate, orig, 98_000, original_async_throughput=orig_thr, best_throughput_until_now=best_thr) # 2.97μs -> 2.73μs (8.80% faster)
# ----------- 2. EDGE TEST CASES -----------
def test_edge_runtime_just_below_threshold():
    # Improvement is just below the threshold (should reject)
    orig = 100_000
    opt = int(orig / (1 + MIN_IMPROVEMENT_THRESHOLD - 0.0001))  # Just under threshold
    candidate = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(candidate, orig, None) # 1.94μs -> 1.78μs (8.97% faster)
def test_edge_runtime_just_above_threshold():
    # Improvement is just above the threshold (should accept)
    orig = 100_000
    opt = int(orig / (1 + MIN_IMPROVEMENT_THRESHOLD + 0.0001))  # Just over threshold
    candidate = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(candidate, orig, None) # 1.74μs -> 1.57μs (10.9% faster)
def test_edge_small_original_runtime_noise_floor(monkeypatch):
    # For original_code_runtime < 10_000, noise floor is 3x threshold
    orig = 9000
    opt = int(orig / (1 + 3 * MIN_IMPROVEMENT_THRESHOLD + 0.0001))  # Just over noise floor
    candidate = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(candidate, orig, None) # 1.94μs -> 1.84μs (5.37% faster)
def test_edge_small_original_runtime_just_below_noise_floor(monkeypatch):
    # For original_code_runtime < 10_000, just below noise floor
    orig = 9000
    opt = int(orig / (1 + 3 * MIN_IMPROVEMENT_THRESHOLD - 0.0001))  # Just under noise floor
    candidate = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(candidate, orig, None) # 1.78μs -> 1.77μs (0.564% faster)
def test_edge_zero_optimized_runtime():
    # Optimized runtime is zero (should not crash, should return False)
    orig = 100_000
    opt = 0
    candidate = OptimizedCandidateResult(best_test_runtime=opt)
    codeflash_output = speedup_critic(candidate, orig, None) # 2.29μs -> 1.80μs (27.2% faster)
def test_edge_zero_original_throughput():
    # Original throughput is zero, should not crash, throughput gain is 0
    orig = 100_000
    opt = 99_000
    orig_thr = 0
    opt_thr = 1000
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_thr)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr) # 2.84μs -> 2.60μs (9.29% faster)
def test_edge_none_throughput_values():
    # Throughput values are None, should fallback to runtime only
    orig = 100_000
    opt = 80_000
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=None)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=None) # 2.34μs -> 2.19μs (6.89% faster)
def test_edge_none_candidate_throughput():
    # Candidate throughput is None, should fallback to runtime only
    orig = 100_000
    opt = 80_000
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=None)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=1000) # 2.31μs -> 2.20μs (4.99% faster)
def test_edge_none_original_throughput():
    # Original throughput is None, should fallback to runtime only
    orig = 100_000
    opt = 80_000
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=1200)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=None) # 2.16μs -> 2.01μs (7.45% faster)
def test_edge_throughput_and_runtime_both_worse():
    # Both throughput and runtime are worse, must reject
    orig = 100_000
    opt = 120_000
    orig_thr = 1000
    opt_thr = 900
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_thr)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr) # 2.96μs -> 2.54μs (16.1% faster)
def test_edge_throughput_better_runtime_worse():
    # Throughput is significantly better, runtime is worse, should accept
    orig = 100_000
    opt = 120_000
    orig_thr = 1000
    opt_thr = 1100
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_thr)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr) # 2.79μs -> 2.42μs (14.8% faster)
def test_edge_throughput_better_but_not_best():
    # Throughput is improved but not the best so far, should reject
    orig = 100_000
    opt = 120_000
    orig_thr = 1000
    opt_thr = 1100
    best_thr = 1200
    candidate = OptimizedCandidateResult(best_test_runtime=opt, async_throughput=opt_thr)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr, best_throughput_until_now=best_thr) # 3.04μs -> 2.59μs (17.4% faster)
# ----------- 3. LARGE SCALE TEST CASES -----------
def test_large_scale_many_candidates_runtime(monkeypatch):
    # Test with a large number of candidates, only the best one should be accepted
    orig = 1_000_000
    best_so_far = 800_000
    # Try a batch of 1000 candidates with slightly worse runtimes
    for i in range(1000):
        candidate = OptimizedCandidateResult(best_test_runtime=best_so_far + i + 1)
        codeflash_output = speedup_critic(candidate, orig, best_so_far) # 510μs -> 451μs (13.0% faster)
    # Now test with a better candidate
    candidate = OptimizedCandidateResult(best_test_runtime=750_000)
    codeflash_output = speedup_critic(candidate, orig, best_so_far) # 601ns -> 501ns (20.0% faster)
def test_large_scale_throughput_candidates():
    # Test with many throughput candidates, only the best and improved one should be accepted
    orig = 1_000_000
    orig_thr = 10_000
    best_thr = 12_000
    # All these are not best so should be rejected
    for i in range(1000):
        candidate = OptimizedCandidateResult(best_test_runtime=900_000, async_throughput=best_thr - 1 - i)
        codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr, best_throughput_until_now=best_thr) # 783μs -> 653μs (20.0% faster)
    # Now test with a new best throughput
    candidate = OptimizedCandidateResult(best_test_runtime=900_000, async_throughput=13_000)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr, best_throughput_until_now=best_thr) # 872ns -> 721ns (20.9% faster)
def test_large_scale_performance(monkeypatch):
    # Simulate a large batch of candidates, all with significant improvement
    orig = 1_000_000
    for i in range(1000):
        candidate = OptimizedCandidateResult(best_test_runtime=orig - 20_000 - i)
        codeflash_output = speedup_critic(candidate, orig, None) # 486μs -> 427μs (13.7% faster)
def test_large_scale_edge_thresholds():
    # Many candidates, some just above and some just below threshold
    orig = 1_000_000
    # Just below threshold
    opt = int(orig / (1 + MIN_IMPROVEMENT_THRESHOLD - 0.0001))
    for _ in range(500):
        candidate = OptimizedCandidateResult(best_test_runtime=opt)
        codeflash_output = speedup_critic(candidate, orig, None) # 244μs -> 215μs (13.4% faster)
    # Just above threshold
    opt = int(orig / (1 + MIN_IMPROVEMENT_THRESHOLD + 0.0001))
    for _ in range(500):
        candidate = OptimizedCandidateResult(best_test_runtime=opt)
        codeflash_output = speedup_critic(candidate, orig, None) # 241μs -> 212μs (13.7% faster)
def test_large_scale_throughput_and_runtime_mixed():
    # Many candidates, some with only throughput improvement, some with only runtime, some with both, some with neither
    orig = 1_000_000
    orig_thr = 10_000
    # Only throughput improves
    candidate = OptimizedCandidateResult(best_test_runtime=995_000, async_throughput=11_000)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr) # 3.04μs -> 2.58μs (17.4% faster)
    # Only runtime improves
    candidate = OptimizedCandidateResult(best_test_runtime=900_000, async_throughput=10_000)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr) # 1.45μs -> 1.24μs (16.8% faster)
    # Both improve
    candidate = OptimizedCandidateResult(best_test_runtime=900_000, async_throughput=11_000)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr) # 1.01μs -> 832ns (21.6% faster)
    # Neither improves
    candidate = OptimizedCandidateResult(best_test_runtime=1_010_000, async_throughput=9_000)
    codeflash_output = speedup_critic(candidate, orig, None, original_async_throughput=orig_thr) # 922ns -> 802ns (15.0% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.To test or edit this optimization locally git merge codeflash/optimize-pr687-2025-09-24T19.11.26
Click to see suggested changes
| # Runtime performance evaluation | |
| noise_floor = 3 * MIN_IMPROVEMENT_THRESHOLD if original_code_runtime < 10000 else MIN_IMPROVEMENT_THRESHOLD | |
| if not disable_gh_action_noise and env_utils.is_ci(): | |
| noise_floor = noise_floor * 2 # Increase the noise floor in GitHub Actions mode | |
| perf_gain = performance_gain( | |
| original_runtime_ns=original_code_runtime, optimized_runtime_ns=candidate_result.best_test_runtime | |
| ) | |
| if best_runtime_until_now is None: | |
| # collect all optimizations with this | |
| return bool(perf_gain > noise_floor) | |
| return bool(perf_gain > noise_floor and candidate_result.best_test_runtime < best_runtime_until_now) | |
| runtime_improved = perf_gain > noise_floor | |
| # Check runtime comparison with best so far | |
| runtime_is_best = best_runtime_until_now is None or candidate_result.best_test_runtime < best_runtime_until_now | |
| throughput_improved = True # Default to True if no throughput data | |
| throughput_is_best = True # Default to True if no throughput data | |
| if original_async_throughput is not None and candidate_result.async_throughput is not None: | |
| if original_async_throughput > 0: | |
| throughput_gain_value = throughput_gain( | |
| original_throughput=original_async_throughput, optimized_throughput=candidate_result.async_throughput | |
| ) | |
| throughput_improved = throughput_gain_value > MIN_THROUGHPUT_IMPROVEMENT_THRESHOLD | |
| throughput_is_best = ( | |
| best_throughput_until_now is None or candidate_result.async_throughput > best_throughput_until_now | |
| ) | |
| if original_async_throughput is not None and candidate_result.async_throughput is not None: | |
| # When throughput data is available, accept if EITHER throughput OR runtime improves significantly | |
| throughput_acceptance = throughput_improved and throughput_is_best | |
| runtime_acceptance = runtime_improved and runtime_is_best | |
| return throughput_acceptance or runtime_acceptance | |
| noise_floor = 3 * MIN_IMPROVEMENT_THRESHOLD if original_code_runtime < 10000 else MIN_IMPROVEMENT_THRESHOLD | |
| if not disable_gh_action_noise and env_utils.is_ci(): | |
| noise_floor = noise_floor * 2 # Increase the noise floor in GitHub Actions mode | |
| perf_gain = ( | |
| (original_code_runtime - candidate_result.best_test_runtime) / candidate_result.best_test_runtime | |
| if candidate_result.best_test_runtime != 0 | |
| else 0.0 | |
| ) | |
| runtime_improved = perf_gain > noise_floor | |
| runtime_is_best = best_runtime_until_now is None or candidate_result.best_test_runtime < best_runtime_until_now | |
| # Combine throughput logic for tighter critical-path performance | |
| if original_async_throughput is not None and candidate_result.async_throughput is not None: | |
| if original_async_throughput > 0: | |
| throughput_gain_value = ( | |
| candidate_result.async_throughput - original_async_throughput | |
| ) / original_async_throughput | |
| throughput_improved = throughput_gain_value > MIN_THROUGHPUT_IMPROVEMENT_THRESHOLD | |
| else: | |
| throughput_improved = True | |
| throughput_is_best = ( | |
| best_throughput_until_now is None or candidate_result.async_throughput > best_throughput_until_now | |
| ) | |
| # Accept if either throughput or runtime improvement is good and is best so far | |
| return (throughput_improved and throughput_is_best) or (runtime_improved and runtime_is_best) | |
| # No async throughput measured: fallback to only runtime logic | 
User description
dependent on #678
PR Type
Enhancement, Tests, Bug fix
Description
Add async function instrumentation and throughput
Update optimizer for async benchmarking
Extend AST utilities to handle async defs
Add comprehensive async test suites
Diagram Walkthrough
File Walkthrough
4 files
End-to-end async run_and_parse_tests scenariosAsync decorator injection and test instrumentationUnused helper revert with async scenariosValidate async wrapper SQLite capture and perf2 files
Async-aware baseline, benchmarking, and throughputSupport AsyncFunctionDef and async test pruning32 files